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Abstract 

While the use of cluster features became ubiq- 
uitous in core NLP tasks, most cluster fea- 
tures in NLP are based on distributional sim- 
ilarity. We propose a new type of clustering 
criteria, specific to the task of part-of-speech 
tagging. Instead of distributional similarity, 
these clusters are based on the behavior of a 
baseline tagger when applied to a large corpus. 
These cluster features provide similar gains in 
accuracy to those achieved by distributional- 
similarity derived clusters. Using both types 
of cluster features together further improve 
tagging accuracies. We show that the method 
is effective for both the in-domain and out-of- 
domain scenarios for English, and for French, 
German and Italian. The effect is larger for 
out-of-domain text. 

1 Introduction 

The limited amounts of annotated training data 
available for supervised learning call for semi- 
supervised learning approaches, which aim to lever- 
age the vast amounts of readily available unanno- 
tated data in order to improve the accuracies of su- 
pervised systems. 

In natural-language processing, a simple and pop- 
ular method for semi-supervised learning is based on 
word clustering (Miller et al., 2004; Koo et al., 2008; 
Turian et al., 2010): words in a large corpus are clus- 
tered into equivalence classes based on some (usu- 
ally distributional) criteria, and the induced classes 
are then used as additional features in a supervised 
learning model. The use of cluster-based features 
was demonstrated to improve the accuracies of many 



NLP tasks, including parsing (Koo et al., 2008; Can- 
dito and Crabbe, 2009), named-entity recognition 
(Miller et al., 2004; Turian et al., 2010; Lin and Wu, 
2009; Chrupala, 2011), classification of semantic 
relations (Chrupala, 2011) and machine-translation 
(Uszkoreit and Brants, 2008). 

Word clusters are usually induced based on a 
distributional-similarity criteria: words are clustered 
based on the words that tend to occur before or af- 
ter them. Clusters produced by the Brown cluster- 
ing algorithm (Brown et al., 1992) are an example 
of commonly used distributional clustering features. 
In this model, words are clustered by means of a 
probabilistic cluster-based language model. A more 
scalable distributional clustering algorithm is intro- 
duced by Uszkoreit and Brants (2008) who use a par- 
allel implementation of the Exchange algorithm to 
cluster words based on word-to-cluster transitions. 
When used as features, clusters derived using the 
Exchange algorithm are as effective as those derived 
by the Brown algorithm. Other types of distribu- 
tional clustering algorithms rely on word embed- 
dings (Collobert and Weston, 2008; Mnih and Hin- 
ton, 2007). Turian et al. (2010) find that embedding- 
based distributional clusters tend to underperform 
Brown-type clusters. 

The distributional-similarity hypothesis underly- 
ing all these algorithms is that words in similar con- 
texts behave in a similar manner. The notion of sim- 
ilarity is vague and is not specific to any particu- 
lar task. In practice, distributional-similarity based 
clusters show a mix of semantic and syntactic prop- 
erties. Can we design word-clusters capturing prop- 
erties which are relevant for a specific task? 



We focus on the task of part-of-speech (POS) tag- 
ging, and present a novel task-specific clustering cri- 
teria: words are clustered based on the behavior of a 
baseline tagger when applied to a large body of text. 

One of the most useful sources of information for 
tagging a given word w with a tag t is the weighted 
ambiguity class of the word, as represented by the 
conditional tagging distribution p(T = t\w). Our 
first kind of clusters aim to capture exactly this in- 
formation: we cluster words based on their empir- 
ical p(T = t\w) distributions, as observed over a 
large automatically tagged corpus. 

The tag of a word w can be predicted to some ex- 
tent also by the previous word w-\ or the following 
one w + \. We can create word-clusters to capture 
these sources of information by clustering words 
based on the empirical distributions p{T + \ = t\w) 
and p(T_i = t\w), and using these clusters to rep- 
resent the previous and following word respectively. 

The approach is related to self-training in that we 
use the tagger's own prediction in order to improve 
it. However, in contrast to self training, we use 
statistics derived from the tagger's output as addi- 
tional features for supervised training. 

For POS-tagging, our task-specific clusters are as 
effective as those derived by a lexical distributional- 
similarity criteria when used on their own, and have 
a cumulative effect when both kinds of clusters are 
used together. Moreover, the task-specific clusters 
serve as a very good proxy to word identity for 
the purpose of POS-tagging, and we can train com- 
pletely unlexicalized POS-tagging models without 
sacrificing accuracy. 

2 Method 

Our training protocol is as follows: 

1) Train a supervised tagger on POS -annotated text. 

2) Use the tagger to annotate large amounts of raw 
text. 

3) Collect (W, T) counts from the automatically 
tagged text. 

4) For a word w occurring over k times, compute: 

p(T = t\w) = count(w, t) I J2t' count(w, t') 

5) Cluster words based on p(T = t\w). 

We then use the derived clusters as additional fea- 
tures in a discriminatively-trained sequence-tagger. 



When clustering, we encode the conditional 
p{T = t\w) distribution for each word w as a \T\- 
dimensional vector in which the 2th entry is the con- 
ditional probability p(T = ti\w), and cluster words 
based on the Euclidean distance between their vec- 
tors: 

dist(wi,w 2 ) = (p{ti\wi) -p{ti\w2)) 2 

We use the K-means clustering algorithm with the 
initialization procedure described in (Arthur and 
Vassilvitskii, 2007), which stochastically favors 
cluster centers that are far apart from previously cho- 
sen centers. 

Words provide weak signals regarding the POS- 
tag of the next or previous word. We produce clus- 
ters based on the distributions p(T_i = t\w) and 
p(T + i = t\w) in a similar fashion. 

3 Details and Experiments 

3.1 Parameters 

In all the experiments, we set the word frequency 
threshold k to be 100. Due to the large size of our 
unannotated corpus, we still remain with very large 
vocabulary sizes (see Table 2). We run the K-means 
algorithm for 100 iterations, and cluster the words 
into 256 classes. While the baseline-tagger features 
are tuned for good accuracy, we did not perform all 
but minimal tuning of the extended cluster features, 
and did not tune any of the other parameters. 

3.2 Tagger 

We use a first-order linear-chain sequence-tagger 1 , 
trained using the averaged structured-MIRA algo- 
rithm. The features include distributional clus- 
ters derived from the unannotated corpora using 
the Exchange algorithm and are detailed in Table 
1. Throughout the presentation, all features are as- 
sumed to be conjoined with the tag to be predicted. 

'Most previous work on POS-tagging, e.g. (Ratnaparkhi, 
1996; Brants, 2000; Collins, 2002; Toutanova et al., 2003) use 
at-least a second-order model for their better results. In contrast, 
we use a first-order model which is much faster. Thus, our tag- 
ging results are lower than reported in previous work evaluating 
on the WSJ corpus (our train/test split is also somewhat differ- 
ent). Our primary interest in this work is not in demonstrating 
state-of-the-art tagging accuracies on the WSJ corpus but rather 
examining the contributions of different cluster features to the 
tagger accuracy on diverse corpora. 



Type 


Templates 


Lexical 


w 


Signature 


pre/(l) pre/ (2) pre/ (3) 




s«/(l) su/(2) su/(3) 




capitalization hyphen 


Transition 


t-i 


p Cluster-Dist 
+Transition 


PO P-1 P-1 (P-1,P) (P-2,P-l) 
(P0,t-l) (p-i,t-i) 



Table 1 : Tagger features for the baseline tagger. pref(n) 
and suf(n) are prefixes/suffixes of length n of the current 
word wo. The distributional-similarity features p are de- 
rived using the algorithm of (Uszkoreit and Brants, 2008). 

3.3 Datasets 

Annotated data For English, we use the follow- 
ing annotated corpora: 

WSJ The WSJ portion of the Penn Treebank cor- 
pus (Marcus et al., 1993) is used to train all of our 
English tagging models. We train on Sections 2-21, 
and evaluate on Section 22. 

Brown (BRN) The entire Brown corpus portion of 
the Penn Treebank is used for evaluation. 
Questions (QTB) The QuestionBank (Judge et al., 
2006) contain 4,000 questions, which we use for 
evaluation, 

Football (FTBL) We report results on the develop- 
ment set (185 sentences) of the Football corpus of 
(Foster, 2010). In one experiment we use the test 
section (170 sentences) as additional training data. 
Web The entire web portion of the Ontonotes corpus 
(Weischedel et al., 2011) is used for evaluation. 2 

In most experiments we train our tagger on the 
training set of the WSJ corpus and reserve the other 
datasets for evaluation. The baseline tagger is al- 
ways trained on the WSJ training set. 
German We use data and splits from the CoNLL 
2006 shared task (Buchholz and Marsi, 2006). 
French We use the French Treebank (Abeille and 
Barrier, 2004) with splits defined in Candito et al. 
(2010). 

2 Note that the Ontonotes corpus is systematically differ- 
ent from the training corpus in several aspects, including us- 
ing both the "IN" and "TO" tags for the word "to" depend- 
ing on its usage (in the Penn Treebank all, to is consistently 
tagged as TO), and the introduction of additional tags for hy- 
phens and non-sentence-final punctuation. While one could get 
vastly improved accuracies on this dataset by specifically ad- 
dressing these issues, we did not do so in the current work as 
our primary interest is comparing the effect of the various clus- 
ter features on tagging accuracy. 



Language Domain #Tokens Vocabulary 

English News 19 x 10 9 649K 

German News 2.5 x 10 9 386K 

French News 1.4 x 10 9 165K 

Italian News 0.5 x 10 9 116K 

Table 2: Details of unannotated data. Vocabulary is the 
number of token-types appearing more than 100 times. 

Italian We use data and splits from the CoNLL 2007 
shared task (Nivre et al., 2007). 

Unannotated Data We use one year of newswire 
articles from multiple sources from a news aggrega- 
tion website for each language. The datasets range 
in size from 19 to 0.5 billion tokens. The unanno- 
tated data is summarized in Table 2. 

4 Results 
4.1 English 

In the first set of experiments we test the effective- 
ness of the Task-based clustering method on both in- 
domain and out-of-domain English data. 

We begin by distilling the amount of informa- 
tion captured by the different clusters. To this end, 
we train models with the simplest set of features 
possible: for each sequence position we consider 
the lexical item wq, the transition feature t-i, and 
zero or more cluster features. We also train mod- 
els including the cluster features but not the lexical 
items. We evaluate the models on the different En- 
glish datasets. Table 3 detail the results. 



Features 


WSJ 


QTB 


BRN 


FTBL 


Web 


t-i wo 


94.97 


85.93 


91.29 


89.79 


88.77 


t-1 w Po 


96.01 


88.28 


94.11 


91.69 


91.13 


t-i w Co 


96.42 


89.74 


94.95 


92.85 


92.17 


t-i w v-i 


95.21 


86.07 


91.49 


89.58 


89.10 


t_l W T+l 


95.30 


86.34 


91.64 


89.73 


89.04 


t-i po 


94.37 


85.66 


92.17 


89.49 


88.97 


t-i Co 


96.14 


89.24 


94.74 


93.17 


92.06 



Table 3: Tagging accuracies with minimal feature sets. 
p: distributional-clusters, C- p(£|w>)-clusters, rj: p(t + i\w)- 
clusters, r: p(i_i|w)-clusters. All models are trained on 
the training portion of the WSJ corpus. 



Note that this is not exactly a domain-adaptation 
scenario, as all the unannotated data is from the 
Newswire domain. Still, the cluster features con- 
tribute to tagging accuracies across all the datasets. 



When the current word wq is present as feature, the 
distributional clusters po is somewhat less informa- 
tive than the task-specific clustering Co> which is 
based on p(t\w). The cluster features of the next 
and previous words (t/_i and r+i) are expectedly 
less informative than the cluster associated with the 
current word, but still contain some predictive infor- 
mation. When we exclude the word from the fea- 
ture set and rely only on the cluster information (the 
last two rows of the table), the task-specific clus- 
ters Co do particularly well - compensating almost 
completely over the missing word identity informa- 
tion. The models relying solely on the previous tag 
and the task-specific cluster (t-i Co) are significantly 
better than the models relying on the previous tag 
and the explicit word identity wq). 

We then proceed to evaluate the effectiveness of 
the cluster features in the context of a richer feature 
set. We use the feature-sets described in Tables 1 
and 4. We consider different subsets of the cluster 
features. Results are presented in Table 5. 



Type 


Templates 


( Cluster p(t\w) 


Co C-i C-2 (C-i, Co) (C-2,C-i> 


+Transition 


(Co,t-i) (C-i>*-i) 


Tf Cluster p(t+i\w) 


n-i 


+Transition 


(V-l,t-l) 


t Cluster p(t-i\w) 





Table 4: Additional cluster features. 





No 


Dist 


Task 


Dist+Task 


All 


All 




clusters 


P 


c 


PC 


P C V T 


(no w ) 


WSJ 


96.35 


96.90 


96.82 


97.01 


97.02 


97.02 


QTB 


88.86 


90.74 


90.50 


90.83 


90.93 


90.93 


BRN 


94.37 


95.57 


95.48 


95.68 


95.72 


95.70 


FTBL 


91.96 


93.38 


93.44 


93.74 


93.80 


94.03 


Web 


91.38 


92.81 


92.82 


92.99 


93.05 


93.05 



Table 5: Tagging accuracies using the different cluster- 
ings within a rich feature set. All models are trained on 
the training portion of the WSJ corpus. Last column con- 
tain all the features but the lexical one. 



Expectedly, using the richer feature-sets improve 
results for all models. The cluster features still con- 
tribute to tagging accuracies across all the datasets. 
The contribution of the task-based clusters (Task) is 
similar but a bit lower than that of the distributional 
clusters (Dist), but results improve when the two 
clustering approaches are combined (Dist+Task). 
Adding the task based clustering of neighboring 



words (All) further improve the results on most 
datasets. The largest improvements are observed on 
the out-of-domain datasets. Somewhat surprisingly, 
dropping the explicit lexical feature wq (last column) 
does not hurt performance, and even significantly 
improve it on the Football dataset. 

4.2 English - Additional Training Data 

In the next experiment, we target the situation in 
which we have a small amount of annotated data in 
an interest-domain in addition to the larger amount 
of out-of-domain data. We use the test-set of the 
Football dataset as additional in-domain training 
material. Results are presented in Table 6. As ex- 
pected, using the additional in-domain training data 
improve the results. However, the contribution of the 
additional data is small, as most of the gap is already 
covered by the cluster features. When using all the 
cluster features but no lexicalization (last column) 
training on WSJ alone outperform the joint training. 



Train 


No 


Dist 


Task 


Dist+Task 


All 


All 




clusters 


P 


c 


PC 


P C V t 


(no w ) 


WSJ 


91.96 


93.38 


93.44 


93.74 


93.80 


94.03 


+FTBL 


92.31 


93.47 


93.50 


93.92 


93.94 


93.83 



Table 6: Adaptation results to the Football domain when 
training on both datasets. 

4.3 German, French and Italian 

We observe similar trends on languages other than 
English (Table 7). The additional task-specific clus- 
ter features improve performance across all lan- 
guages. 



Language 


No Clusters 


Dist 


Task 


Dist+Task 


All 






P 


c 


PC 




German 


96.48 


97.68 


97.70 


97.84 


98.00 


French 


96.45 


97.55 


97.54 


97.66 


97.74 


Italian 


93.58 


96.00 


96.15 


96.43 


96.39 



Table 7: Tagging results for German, French and Italian. 

5 Conclusions 



We presented a task-specific word clustering method 
for POS-tagging. The method is effective across 
domains and languages. The automatically derived 
clusters capture the essence of the lexical items with 
respect to the task to the extent that the cluster fea- 
tures can replace the actual lexical items. We would 
like to see task-specific clusterings for other, more 
challenging tasks. 
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